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Abstract 



Motivated by the prevalence of high dimensional low sample size datasets in mod- 
>0 ern statistical applications, we propose a general nonparametric framework, Direction- 

f — . Projection- Permutation (DiProPcrm), for testing high dimensional hypotheses. The 

method is aimed at rigorous testing of whether lower dimensional visual differences are 

C^ statistically significant. Theoretical analysis under the non-classical asymptotic regime 

^ ■ of dimension going to infinity for fixed sample size reveals that certain natural varia- 

IJ tions of DiProPcrm can have very different behaviors. An empirical power study both 

k> confirms the theoretical results and suggests DiProPcrm is a powerful test in many 

^ settings. Finally DiProPcrm is applied to a high dimensional gene expression dataset. 
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binary classification; Maximal Data Piling; permutation test; Support Vector Machine; 
two-sample problem. 

1 Introduction 

We propose a nonparametric procedure for testing high dimensional hypotheses that is 
especially practical in high dimensional low sample size (HDLSS) settings. HDLSS data sets 
arise in many modern applications of statistics, including genetics, chemometrics, and image 
analysis. An intuitive approach to looking for differences between two high dimensional 
distributions is by looking for differences between their one dimensional projections onto 
some appropriate direction. DiProPerm is a three-stepped procedure based on this idea. 
The procedure is as follows: 

1. Direction — take the normal vector to the separating hyperplane of a binary linear 
classifier trained on the class labels. 

2. Projection — project data from both samples onto this direction and calculate a 
univariate two-sample statistic. An illustration of this can be seen in the first panel 
of Figure [T] 

3. Permutation — assess the significance of this univariate statistic by a permutation 
test. Namely, (a) pool the two samples and permute the class labels; (b) take the 
normal vector to the binary linear classifier retrained on the permuted class labels; 
(c) project data onto this direction and re-calculate the univariate two-sample statistic. 
An illustration of this can be seen in the last three panels of Figure [l} For a level a 
test, we reject the null if the original test statistic is among the 100a% largest of the 
permuted statistics. 

DiProPerm is not a single test but a general hypothesis testing framework. The number of 
combinations of direction and univariate statistic is large. We will focus on a select few in 
this paper but more options are discussed in detail in Section [C] of the Supplement. 

In general, we are interested in testing the hypotheses: 1) equality of two distributions 
and 2) equality of means. That any DiProPerm test is an exact level a test for equality 
of distributions follows immediately from general permutation test theory. A perhaps sur- 
prising point is that for testing equality of means, validity does not hold for some natural 
versions of DiProPerm. In this paper we study the theoretical properties of two particular 



DiProPerm tests. We will show that one is valid for testing equality of means while the 
other is not. 

1.1 A Motivating Example 

Lower dimensional projections in directions of interest are often used to understand struc- 
ture in high dimensional data. One example is the directions found by applying Principal 



Component Analysis (PC A), see Jolliffe (2002) for an excellent introduction, which yields 



directions maximizing variation. When there are two classes however, as in the case we are 
studying, additional insights come from directions based on binary linear classifiers, where 
a binary classification decision is based on the value of a linear combination of the data 
features. 

In very high dimensions many linear classifiers over- fit. Here is a simple example illus- 
trating this. Draw two independent samples, each of size 50, from the 1000-variate standard 
Gaussian distribution. We use the Distance Weighted Discrimination (DWD) direction in 



step 1 of DiProPerm (Marron et al. , 2007). DWD is a binary linear classifier similar to the 



Support Vector Machine (SVM) with certain advantages in high dimensions, see Cortes and 



Vapnik (1995) for an introduction to SVM. 

The first panel of Figure [l] shows the one dimensional projection of the data onto the 
DWD direction trained on the original class labels. Colors are used to represent original 
class membership and are thus constant throughout the first three panels. The projections 
are jittered on the y-axis to allow easy visualization. A kernel density estimate of the 
projections is plotted in the background (solid black line). We see that the projections in 
the first panel of Figure [T] are very well separated despite the fact that the samples arise 
from the same underlying distribution. This clear over-fitting artifact common in HDLSS 
data is a strong motivation for DiProPerm. 

The middle two panels of Figure [T] show projections of the data onto re-trained DWD 
directions, each based on a realization of randomly permuted class labels. Symbols are 
used to represent permuted class labels and are thus different in the first three panels. We 
find the projections here to be well separated with respect to the symbols. Relative to 
the second and third panels, the original separation we observed in the first panel is quite 
unremarkable, suggesting that the two underlying distribution are not different. 

The last panel in Figure [T] confirms this observation. We perform a DiProPerm test 
with 100 permutations and display the statistic, chosen here to be the difference of sample 
means, calculated for each permutation. The vertical line is the original statistic calculated 
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Figure 1: The data are standard 1000-variate Guassian. In the first panel, the DWD 
direction is trained on the original class labels, represented by colors (same in all panels). 
In the second and third panels, the DWD directions are trained on realizations of randomly 
permuted class labels, represented by symbols (different in each panel). The separation 
in the first panel is comparable to that in the second and third panels. One hundred 
permutation statistics resulting from a DiProPerm test are shown in the last panel which 
confirms the separation in the first panel is not significant. 

on the unprojected data. We see that based on the DiProPerm test, the null hypothesis of 
equal distributions should not be rejected. 



1.2 The Hypotheses 

Let Xi , . . . , Xm and Yi , . . . , i^ be independent random samples of M -valued random vec- 
tors, d>\ with distributions Fi and F2, respectively. We are interested in testing the null 
hypothesis of equality of distributions 



Hq: Fi= F2 versus Hi : Fi ^ F2 



(1) 



Let /u(i^) denote the mean of a distribution F. Another item of interest is to test the weaker 
null hypothesis of equality of means 

Ho : fi{Fi) = fi{F2) versus Hi : /x(Fi) ^ ^(^2) (2) 

Note that the multivariate Behrens-Fisher problem concerns testing (pi) under normality. 

1.3 Overview 

The outline for the paper is as follows. A review of related work is presented in Section [2] 
In Section [3j two DiProPerm tests are closely examined. HDLSS asymptotics are used to 
investigate the validity of these two tests for the weaker null hypothesis of equality of means 



in Section [4j In Section [5] we perform a Monte Carlo power study comparing DiProPerm 
to other methods. Finally in Section [6j DiProPerm is applied to a real microarray dataset. 



2 Related work 

There is extensive literature on testing equality of distributions for two multivariate dis- 
tributions under the classical setting of sample size larger than dimension. For the more 
challenging HDLSS setting, several methods have been developed and we discuss two of 
them here. 



First, there are nearest neighbor tests Bickel and Breiman ( 1983 ); Henze ( 1988 ); Schilling 



( 1986 ) which are based on nearest neighbor coincidences - the number of neighbors around 



a data point that belong to the same sample. The null distribution of the test statistic can 
be derived parametrically using normal theory or nonparametrically using a permutation 
test. A more recent contribution to testing equality of distributions under HDLSS settings 



is Szekely and Rizzo's nonparametric energy test (Szekely and Rizzo 2004). The energy 



test statistic is based on the Euclidean distance between pairs of sample elements. Here 
significance is accessed through permutation testing. 

The nearest neighbor test and the energy test require calculation of all pairwise distances 
between sample elements. The computational complexity of both tests is independent of 



dimension, and is thus suitable for the HDLSS setting. In Section 5.1 we perform an 
empirical power study comparing DiProPerm to the energy test. 

For testing equality of means for two multivariate distributions, the classical Hotelling 
T^ test is often used in the setting of sample size larger than dimension. However, the 
Hotelling T^ statistic is not computable in HDLSS situations because the covariance matrix 



is not of full rank. To address this issue, the methods in (Bai and Saranadasa 1996), (Chen 



and Qin 2010 ), and ( Srivastava and Du 2008 ) replace the covariance matrix in the Hotelling 



T^ statistic by a diagonalized version. 

Taking a different approach, the method proposed by Lopes et al. projects the high 
dimensional data onto a random subspace of low enough dimension so that the traditional 



Hotelling T statistic may be used (Lopes et al. , 2011 ). All of these tests have the disadvan- 



tage that equal covariances are assumed, which is not a restriction we place on DiProPerm. 
In Section [5. 2| we perform an empirical power study comparing DiProPerm to the Random 



Projection test proposed in Lopes et al. (2011). 



3 The Choice of The Univariate Statistic 

Here, we study the difference between two particular choices of the univariate statistic 
in Step 2 of DiProPerm. First, let the Mean Difference (MD) direction be the vector 
connecting the centroids of each sample. For simplicity, we will use this particular direction 
to compare two natural statistics of the projections: 1) the Mean Difference (MD) statistic 
— the difference of sample means, and 2) the two-sample t-statistic (t) — difference of 
sample means divided by {si/m + S2/n}'^''^ where si and S2 are sample standard deviations 
of each class, sized m and n respectively. Henceforth we specify different DiProPerm tests 
by concatenating the direction name and two sample univariate statistic name. Following 
this convention, the DiProPerm test that uses the MD direction and the MD statistic will 
be referred to as the MD-MD test and the DiProPerm test that uses the MD direction 
and the two-sample t statistic as the MD-t test. 

We provide a toy example to contrast the difference between the MD and t statistic. 
We draw independent samples, each of size 50, where the first sample arises from the 1000- 
variate standard Gaussian distribution and the second the 1000-variate distribution with 
iid marginal f(5) distributions. Note that the samples arise from different distributions 
that have the same means. Figure [2] shows the one dimensional projection of the data onto 
various MD directions and the MD and t statistic applied to these projections. 

The lengths of the longer horizontal black bars represent the MD statistics while the 
lengths of the shorter horizontal bars represent the sample standard deviations of the pro- 
jected data in each permuted group. The MD statistic and two-sample t statistic calculated 
on the projected data are displayed towards the top of each panel. We see that the t- 
statistic in the first panel is much higher than the permuted t-statistics in the second and 
third panels. On the other hand, the MD statistic is about the same between the original 
and permuted worlds. We confirm this is a systematic pattern by looking at 1000 permuta- 
tions and calculating the MD-t statistic. The distribution of the permuted MD-t statistics 
can be seen in the last panel of Figure [2} We see that the original MD-t statistic, repre- 
sented as a vertical line, is among the larger permutation statistics, leading us to reject the 
null hypothesis. The distribution of the MD-MD permutation statistics, not shown here, 
looks very similar to the last panel of Figure [T| where the original statistic is close to the 
middle of the permutation distribution. Thus under this setting the MD-t test rejects the 
nuU while the MD-MD does not. 

This apparent inconsistency is due to the fact that the MD and t statistics are actually 
testing different hypotheses. The former is testing the weak hypothesis of equality of means 
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Figure 2: The first sample arises from a standard 1000-variate Gaussian distribution and 
tlie second sample arises from the 1000-variate distribution with iid t(5) marginals. In the 
first panel, the MD direction is trained on the original class labels, represented by colors. 
In subsequent panels, the MD direction is trained on realizations of permuted class labels, 
represented by symbols. Note that the MD-MD statistic is similar across the first three 
panels while the MD-t statistic is much larger in the first panel. One thousand permutation 
MD-t statistics are shown in the last panel. The empirical p-value is small suggesting the 
test that uses MD-t would reject the null. 

while the latter is testing the strong hypothesis of equality of distributions. In light of this, 
each test is correct in its decision. This phenomenon is studied in detail in the next section. 



4 Hypothesis Test Validity 

In this section, we study the validity of the MD-MD and the MD-t for testing 1) equality 
of distributions and 2) equality of means. We work with the MD direction because it is 
most amenable to theoretical analysis. Future work will include other directions such as 
DWD, SVM, etc. High dimensional geometric representation of SVM and DWD described 
in 



Bolivar-Cime and Marron (2013) could provide the basis for this endeavor. 



That both the MD-MD and the MD-t are exact tests for equality of distributions follows 
from standard theory on permutation tests. We will discuss how an exact level a test can 
be constructed by a permutation test. Let N = m + n and write Z = (Zi, . . . , Zjv) = 
{Xi, . . . , Xm, Yi, . . . , Yn) for the pooled sample. Let {7r(l), . . . , vr(A^)} be a permutation of 
{1, . . . , A^}. Write Z^ = (Z^(i), . . . , Z^(Af)) for the permuted sample. Let Gn denote the set 
of all permutations tt of {1, . . . , A^}. Then for any test statistic Vm,n = ^,n(^ii ■ ■ ■ , Zjy), 
we can calculate ym,n(-^7r(i); • • • ) ■^7r(Af)) for all vr € Gn- The test that rejects the null if 
the original statistic Vm,niZi, . . . , Zn) is larger than (1 — a) 100% of the permuted statistics 



Vm,n{Z^ri\, . . . , Z^^]\^\) is an exact level a test. The exactness conies from the fact that the 
unconditional distribution and the permutation distribution of the statistic coincide under 
the null of equal distributions. It follows that the MD-MD test and the MD-t test, and any 
other DiProPerm test, are exact for testing equality of distributions. 

The matter of establishing validity for testing equality of means is not as straightforward 
on the other hand. In general, permutation tests cannot be expected to be valid for testing 
weaker hypotheses such as equality of means. For instance, if the covariances are not the 
same, we have to be very careful with our choice of direction and two-sample statistic. The 
signal in the covariances may confound our interpretation of tests that are sensitive to both 
the signal in the mean and the signal in the variances. This is consistent with our results 
which show that under normality and balanced sample sizes, the MD-MD remains valid for 
testing equality of means under heterogeneous covariances. On the other hand, the MD-t 
is invalid when the covariances are not the same. 

4.1 MD-MD 

In this section, we establish that the MD-MD test is an exact test for equality of means 
under normality and balanced sample sizes. The MD-MD test statistic, Tm,niZ), is the 
mean of the projections of the X's onto the unit vector in the direction oi X — Y minus the 
mean of the projections of the Y's onto the unit vector in the direction of X — Y: 

m^ M|X-y|| n^ ^\\X-Y\\ ^' 

1=1 j=i 

= \\X-Y\\ (5) 

Theorem 1. Let Xi, . . . ,Xm he an iid sample from the d-variate Gaussian distribution 
N{px, '^x) <ind Yi, . . . ,Yn be an independent sample drawn iid from the d-variate Gaussian 
distribution N{pY,T,y) where T,x 7^ Sy. If m = n then the unconditional distribution and 
the permutation distribution of Tm,n{Z) are equal under the null nx = A*y- 

Proof. Under fix = fJ'Y, ^ — 1^ is distributed as 

iV(0, S^/m + Sj,/n) (6) 

and the permutation distribution of X — y is 

rn im,\ ln\ 



{m - r)T.x + rY^y rT.^ + {n - r)Y.y 

Km/ 



r=0 



If m = n, the expressions in ([6| and ([7]) are the same, in which case the unconditional and 
permutation distribution of Tm,niZ) are also the same. D 

4.2 MD-t 

The MD-t statistic, denoted by Um,n{Z), is the result of applying the unbalanced sample 



sizes, unequal variance two-sample t-test statistic (also known as Welch's t-test (Welch 



1947)) to the projections onto the MD direction. Let a ■ b denote the standard dot product 
between two vectors in M . The sample variances of the projected data can be expressed as 



4 = - 



^ m 

— 5][(x,-x).(x-y)]^ 



and 



4 = - 



1 "^ 



Define Sm,n{Z) = Sm,n{Xi, . . . , Xm, Yi, . . . , Yn) = s^^/m + s^/n. The MD-t statistic is 

Um,n\Z) = Um,n{Xl, . . . , Xm, Yl, ■ ■ ■ , Yn) = Tm,n{Z) /{Sm,n{Z)} 



where Tm,n{Z) is as in Section 4.1 We use the term "projected" rather loosely here since 
we have not normalized by | |X — y 1 1 . This is of no actual consequence since the two-sample 
t-statistic is scale invariant. 

Under equal means the numerator in the MD-t statistic behaves similarly in the permu- 
tation world and the original world. However, we will see that the denominator of the MD-t 
statistic has very different behavior. We find that the denominator of the MD-t is larger in 
the permuted world, as seen in Figure [2] This has the effect of making the unconditional 
distribution of the MD-t statistic larger than the permutation distribution. 

To gain some intuition, consider the following toy HDLSS example. Suppose we observe 
Xi,X2 ~ Fi and 11,^2 ~ i^2 where Fi = 7V(0,/rf) and F2 = N{Q,a'^Id), a^ ^ 1. The 
points Xi,X2,yi,l2 form the vertices of a tetrahedron in three dimensional space. The 
two-dimensional plane generated by 11 , I2 and X is shown in Figure ^ Distances between 



elements of interest are calculated using standard HDLSS asymptotics, see Hall et al. ( 2005 ) 
for examples of this type of calculation. All distances have an additional Op{\) term that 
is not shown to avoid clutter. The geometric configuration in Figure [3] has the implication 
that s^ is small. To see this, note the projections of Yi and Y2 onto the MD direction 
X — Y is close to the projection of Y itself. A similar argument can be applied to show s^- 
is small. 




Figure 3: Plane generated by Yi, Y2 and X where Xi, X2 ^ Fi = N{0,ld) and 11,12 ~ 
F2 = N{0,a'^I(i) for cj^ 7^ 1. Note that the projections of Yi and Y2 onto X — Y is close to 
the projection of Y onto X — Y. This has the implication that s^ will be small. 



Now let's look at what happens in the permutation world. Figure |4] shows the two- 
dimensional plane generated by the realization of a random permutation where X* = X2, 
I2 and y"]* = Xi and Y2 = li. Notice that the distance between 1^" and X* is 



X* 



different than the distance between 1^* and X*. This has the effect of making s^^, the 
sample variance of the the projections of Y^ and Y2 , large. To see this, note the projections 
of 1"]* and Y2 onto the permuted MD direction are not close to the projection of y*. A 
similar argument can be applied to show s^, , the sample variance of the projections of X^ 
and X|, is large. The derivations for the distances shown in Figures |3] and 111 can be found 
in the supplement. 

The toy example above suggests the denominator of the MD-t statistic is larger in the 
permutation world than in the original world. The next result gives us a sense of just how 
far apart are the permutation and unconditional distributions of Sm,n{Z). 

Theorem 2. Let Xi, . . . , X„i he a sample from the d-variate Gaussian distribution N{fix,cr'^Id) 
andYi, . . . ,Yn be an independent sample from the d-variate Gaussian distribution N{fj,y, (Tyld) 



where a^ 7^ a'^y are scalars. Under ^^ 



'^m,n\^) 



m n ' 



^y, we have 
1 



0": 



m 



1 m 



X (m- 1) + 



y ^2 



n 



1 n 



X {n 
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Figure 4: Plane generated by a particular permutation realization of Xi,X2,Yi, and Y2. 
Note that the projections of Yj* and 5^* onto X* — Y* is not close to the projection of Y* 
onto X* — Y*. This has the implication that s^^ may be large. 



as d goes to infinity. For the permuted version, we have for some non-zero constant c, 

1 



d2 



Sm,n{Z-K) -^ c in probability. 



The results of this theorem are surprising in that the denominator of the MD-t statistic 
is actually of different orders in the unconditional and permutation worlds. In particular, 
in the unconditional world Sm,n{Z) grows like a random variable times d, while in the 
permutation world it grows like a constant times d^. 

Let us revisit the toy example earlier and see what Theorem [2] can tell us. We make 
50 draws from Fi = N{0,la) and another 50 independent draws from F2 = A^(0,100/d). 
We show in Figure [5} using 1000 Monte Carlo realizations, the simulated permutation and 
unconditional distributions of the MD-t statistic for various dimensions. 

Under the conditions in Theorem [21 when fix = /^yi the numerator of the MD-t statistic 
is proportional to a x^{d) variable for both the unconditional and permutation distribution. 
On the other hand, by the results in Theorem I2l5'm,n(-^) is of the order \/d and d for the 
unconditional and permuted distributions, respectively. Thus we should expect the MD-t 
statistic to be of the order yd in the original unconditional world and 1 in the permutation 
world. This is consistent with Figure [5] — the unconditional distribution is centered around 
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Figure 5: The unconditional and permutation distribution of the MD-t statistic for the dis- 
tributions -Fi = N{0,ld) and F2 = N{0, 100/^). The separation between the unconditional 
and permutation distribution increases with dimension. 



\/d while the permutation distribution is not growing with d. As Figure [s] illustrates, the 
unconditional distribution quickly separates from the permutation distribution as dimension 
increases. Thus it is very important that the MD-t statistic not be used when the goal is 
to test for equality of means. On the other hand, this shows the MD-t test has some power 
for testing equality of distributions against equal means alternatives. 



4.3 Power surfaces 

In this section, we study the power of the MD-MD and MD-t for testing equality of means. 
In the simulations that follow, we make m draws from Fi = N{fii, erf Id), and n independent 
draws from F2 = N(0, 1^)- We set d = 500 and m = n = 50 for balanced sample sizes and 
m = 50, n = 100 for unbalanced. The dimension d and sample sizes m and n are chosen to 
reflect a HDLSS setting. The significance level is set at a = 0.05. Power is estimated using 
1000 Monte Carlo simulations. Figure [6] displays a 3D surface of power versus //i versus 
af, using a color spectrum from cool to warm corresponding to the range to 1. We also 
show an image underneath the surface where each pixel corresponds to the point in the 3D 
surface above. 
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Figure [6a| displays the estimated power surface of MD-MD under balanced sample sizes. 
By Theorem [l} MD-MD is an exact test for equality of means under balanced sample sizes 
and normality. This is consistent with what we see in Figure [6a] — when the means are equal 
(i.e. Hi = 0), the power is around a = 0.05, as indicated by the streak at fii = 0. When 
sample sizes are unbalanced, see Figure 11 in Section [B] of the Supplement, the MD-MD is 
no longer an exact test and may not even be asymptotically valid as d — )• oo. In Section [B] 
of the Supplement, we propose a modification of MD-MD that should be used when sample 
sizes are unbalanced. 




(a) MD-MD: Balanced 



(b) MD-t: Balanced 



o 




(c) MD-t: Unbalanced 



Figure 6: Power surfaces for testing equality of means of the distributions Fi = N{ni,(j\lii) 
and F2 = N(0,ld)- We see that MD-MD attains the correct level under balanced sample 
sizes. The MD-t test is not valid for testing equality of means regardless of balanced or 
unbalanced sample sizes. 



Figures 



6b 



and 



6c 



show that under heterogeneous covariances (when af 7^ 1), the MD- 
t test of equal means does not attain the correct level for either balanced or unbalanced 
sample sizes. In the immediate region around {fii,af) = (0,1), the power of the MD-t 
test is close to a as expected. However as we move away from (^ui, af) = (0, 1), the power 
quickly increases. Thus if we use the MD-t test for equality of means, we will reject too 
often. On the other hand this shows that the MD-t test has some power for testing equality 
of distributions against alternatives where the means are equal but the distributions are 
not. 



5 Comparison with Other Methods 

In this section we compare DiProPerm to other methods in the simulation contexts described 
in Table [T| First, for testing equality of distributions, we compare the DiProPerm tests 



DWD-t and MD-t to the energy test proposed by Szekely and Rizzo (Szekely and Rizzo 
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2004). Next, for testing equality of means, we compare the DiProPerm tests DWD-MD and 



MD-MD to the Random Projection test proposed by Lopes, Jacob and Wainwright (Lopes 



et al. , 2011). Our simulation results show that no test is universally most powerful. As 



such, our goal is to learn general lessons about the situations under which each method can 
be expected to do well. 



Simulation 


Sample 1 


Sample 2 


SI 


N{0,ld) 


ti5r 


S2 


N{0,^b) 


N{fI,^B) 


S3 


iV([3,30,0,.. 
iV([3,-30,0,. 


■■,0],ld) 


iV([-3,30,0,.. 
iV([-3,-30,0,. 


,0]Jd) 
..,0],ld) 



Table 1: Simulation settings. The notation N{fi, S) denotes a multivariate Gaussian dis- 
tribution with mean fi and covariance S. In SI, the notation t(5) denotes the d-variate 
distribution with iid marginal distribution t{5). In S2, the first 25% of the coordinates in 
/i are zero and the rest are set to 1/y/n. The covariance matrix T,b has a block structure 
(described further in the text). In S3, each distribution is an equally weighted Gaussian 
Mixture of the components listed. 



Simulation SI in Table [T] was taken from Szekely and Rizzo. Simulation S2 is a modifi- 
cation of a simulation found in Lopes, Jacob, and Wainwright. Following their simulation 
setting, we let the covariance matrix S^ be block-diagonal with identical blocks B G R^^^ 
along the diagonal. The matrix B has diagonal entries equal to 1 and off^-diagonal entries 
equal to 0.2. The mean vector is set to the zero vector in sample 1. In the second sample, 
the mean vector is set to zero in the first 25% of the coordinates and the rest is set to 
l/\/n. Simulation S3 looks at data arising from equally weighted Gaussian mixtures with 
the components listed in Table [TJ All DiProPerm tests are implemented using 1000 permu- 
tations. Power is estimated through 1000 Monte Carlo simulations at 0.1 significance level. 
In Figures [7] and |8] we display the power against a range of dimensions. 
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5.1 Equality of distributions 

The energy statistic is based on the Euclidean distance between pairs of sample elements. 
The two-sample test statistic is 



i=l j=l i=l j = l 




The first term measures the average distance between the samples and the last two terms 
measure the average distance within each sample. The significance of the energy test statis- 
tic is assessed using a permutation test. In our implementation of the energy test, we used 
1000 permutations. 

For all simulations in Figure [7[ the sample sizes are set to be unbalanced: m = 50, n = 
150. Figure [7] compares the power of MD-t, DWD-t, and the energy test for testing equality 
of distributions. The first panel shows the result of simulation SI. The standard Gaussian 
and t(5)'^ both have mean zero but different covariances. Note that the signal in the covari- 
ance grows stronger with dimension. In light of this, it is not surprising that the MD-t and 
DWD-t do not perform as well as the energy test which is more attuned to variance effects. 
However, as the dimension increases all three tests attain full power. 

The second panel of Figure [7] shows the results for simulation S2. All three tests perform 
well with power increasing to 1 with dimension. Note that the mean effect is along the 45 
degree line. The structure of Tib has the implication that the directions with highest 
variation are for some constant c, (c, c, c, c, c, 0, . . . , 0), (0, 0, 0, 0, 0, c, c, c, c, c, 0, . . . , 0), and 
etc. Thus the mean effect is further exaggerated by the covariance structure making this a 
rather unchallenging setting for all three methods. 

The result of simulation S3 is shown in the last panel of Figure [7[ Here, both the 
DiProPerm DWD-t and MD-t test are seen to be more powerful than the energy test. This 
is not surprising since by way of its construction, the energy test can be expected to have 
difficulty in separating Gaussian mixture data types. The MD-t has good performance but 
DWD-t has the best power because DWD was developed to handle Gaussian mixture data 
types. 

5.2 Equality of means 

In the RP test proposed by Lopes, Jacob and Wainwright, the data is first projected down 



to a dimension low enough so that the regular Hotelling T^ statistic may be applied (Lopes 



et al. , 2011 ). The projection matrix is a A; x d matrix with iid A^(0, 1) entries where k is the 
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Figure 7: Power comparison of DWD-t, MD-t and the energy test for testing equality of 
distributions under the various simulation settings in Table [T| 

dimension of the lower dimensional subspace. In our implementation of the RP method, we 
follow the authors' recommendation and set the tuning parameter k = [n/2\ . The samples 
are assumed to arise from Gaussian distributions with equal covariances. The resulting 
statistic then follows an F distribution under the null of equal means. For all simulations 
in Figure [8j the sample sizes are set to be balanced: m = 50, n = 50. The standard 
multivariate Gaussian and the multivariate t(5)'^ both have mean zero, and thus the power 
of MD-MD and RP should be around a = 0.1 in simulation SI. The first panel of Figure 
[8] shows this is indeed the case. Note that if MD-MD or RP were to be used for testing 
equality of distributions, neither would have power against alternatives such as in SI. 

In simulation S2, the RP method does not perform as well as MD-MD or DWD-MD. This 
is perhaps due to the DiProPerm tests being able to pick up the mean effect more efficiently 
than the RP method which tries to sense random directions in very high dimensions. Note 
that simulation S2 is a setting in which the MD statistic is powerful for either direction 
DWD or MD as the mean effect is strong. Re-examining Figure [7j we see that the DWD- 
MD is more powerful than the DWD-t and the MD-MD more powerful than the MD-t for 
simulation S2. Recall that the covariance structure in S2 amplifies the mean effect. The 
DiProPerm tests that use the two-sample t-statistic may have lower power than their MD 
counterpart because the standardization in the t-statistic cancels out some of the effect. 

In the Gaussian mixture S3 simulation, the DWD-MD and the RP test are both substan- 
tially more powerful than the MD-MD test. In this setting, the direction of discrimination 
is in the first coordinate direction but the direction of most variation is along the second 
coordinate. Not surprisingly, MD-MD has trouble in this setting. The RP test, which uses 
the Mahalanobis distance, is able to correct for this false signal in the second coordinate 
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direction. DWD-MD is seen to perform slightly better than the RP test. Again, DWD is 
designed to work well in discriminating Gaussian mixture data types and this result matches 
our expectation. 
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Figure 8: Power comparison of MD-MD versus the RP method for testing equality of means 
under the various simulation settings in Table [Tl 

6 Application: Microarray data analysis 



The first application of DiProPerm to a real dataset can be found in Wichers et al. (2007) 



DiProPerm was applied to an HDLSS dataset and used to find a statistically significant 
difference between heart rates of rats among different treatment groups. In this section we 
will apply DiProPerm to a different kind of HDLSS data — gene expression microarray 
data. 

Two HDLSS datasets are examined. The first dataset is denoted UNCGEO and the 
second UNCUP, following the naming convention of their source which can be found at 
http://peroulab.med.unc.edu/. The UNCGEO datasets consists of gene expression data 
of 9674 genes measured on 50 breast cancer patients at UNC. The UNCUP dataset looks 
at the same set of genes measured on 80 breast cancer patients in another study at UNC. 
We performed many different hypotheses of interest within each dataset. We highlight two 
particular comparisons here which highlight the main point that formal hypothesis testing 
is an important component of visualization in high dimensions. 

The UNCGEO patients are divided into standard breast cancer subtypes: 1) Luminal 
A versus 2) Luminal B and the UNCUP data into the groups: 1) Luminals (Luminal A 
and Luminal B) versus 2) HER and Basal. Luminals have a very different gene expression 
signature from HER and Basal. On the other hand, the difference between Luminal A and 
Luminal B is less clear cut. For each dataset, we use DWD-t to test equality of distributions 
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between the gene expression in group 1 and group 2. Note that we have a HDLSS setting 
here since the number of genes well exceeds the sample sizes in each subgroup. 

Figure [9] shows the data projected onto DWD directions. The projections in the left 
panel do not overlap at all whereas the projections in the right panel have a small amount 
of overlap. These projection plots suggest that the separation is better for Luminal A vs. 
Luminal B in the UNCGEO dataset than for Luminals vs. HER & Basal in the UNCUP 
dataset. However as previously seen in the toy example in Section [LT| great care is needed 
before drawing conclusions of this type. 
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Figure 9: One dimensional projection plots onto DWD directions for the UNCGEO dataset 
and the UNCUP dataset. The separation in the projection plot for the UNCGEO dataset 
is more visually pronounced than in the UNCUP dataset. We will rigorously assess this 
visual result using DiProPerm. 



Figure [TO] displays the DiProPerm test results. Each dot represents the test statistic 
resulting from a single permutation in the permutation test. We mark the position of the 
original univariate t-statistic with a vertical dashed line. The empirical p-values show the 
difference in the UNCGEO dataset is not significant while the difference in the UNCUP 
dataset is very significant. (We also display the Guasisan fit p- value and Gaussian fit z- 



score, two other types of "p-values" described in Section C.3 of the Supplement). This result 
on a real world dataset parallels what we saw on the simulated toy dataset in Section |1.1| 
— what may seem to be a visually striking separation in lower dimensional visualizations 
could well be an artifact of over-fitting or sampling variation. 
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Figure 10: DWD-t test result for the UNCGEO (left) and UNCUP (right) datasets. In the 
UNCGEO study (left), the difference between the Luminal A and Luminal B subgroups 
is not significant. In the UNCUP study (right), the difference between the Luminals and 
HER & Basal subgroups is very significant. This is surprising because the projection plots 
in Figure |9| suggest the contrary. 



7 Mat lab Software 

Matlab software for DiProPerm is available at http : //www . unc . edu/~iiiarron/marron_sof tware . html. 
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A Proofs 

Lemma 1. Let Xi, . . . ,X.m be a sample from the d-variate Gaussian distribution N{fj,x, o''^Id) 
andYi, . . . ,Yn be an independent sample from the d-variate Gaussian distribution N{py, (Jyld) 
where o"^ 7^ cjy. Let X^ = X'i^{X — Y). Let Xi:fc_i be the sample mean of Xi, . . . Xk~i- 
Under /x^. = fiy, we have, for k = 2, . . . ,m 






Similarly we have 



/c = 2, . . . ,n. 



d-^l\{Yk-Y^..k-i)) d^ ,,,, ,, ,_ 

— ; ) > A/fO, 1) as d ^ 00 

{AcT2(a2./m + a2/n)}V2 



Proof. We can write X^ — Xi.fc_i as a sum of products 

d 

Xk - Xi.,k-i = Y^i^k - Xi:k-i)^^Hx - y)(p) (8) 

p=i 

where X^P' simply refers to the p-th component in the d-dimensional vector X. The expec- 
tation of the summands in ([s]) is zero: 

E{Xk - Xi:k-iYp\x - y)(p) = E{xi''^x^p^) - Eix^^l^.x^p^) 

= 

Next we look at the variance of the summands. Recall for Gaussian data, zero covariance 
is equivalent to independence. We know the covariance between [X^ — Xi-^k-ip^' and 
{X — YyP' is zero since the expectation of the latter is zero and the expectation of the 
product was shown above to be zero as well. Thus each summand in ([8]) is the product of 
two independent variables. The variance of a product of independent variables (see ?? for 
a derivation), U and V, is 

{EUfVar{V) + {EVfVar{U) + Var{U)Var{V). (9) 

Thus we have 

Var{Xk - X,.,k-i)^P\X - F)(p) = VariX^ - Xi.,k-iYPWar{X - F)(p) 

-a^^ax/m + ay/n) 
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By the Central Limit Theorem, we have 



^'^'^ '" ^■'" ^" ■• iV(0, 1) as d ^ oo 



D 

Lemma 2. Let Xi, . . . ,Xm be a sample from the d-variate Gaussian distribution N {fix, o'x-^d) 
andYi, . . . ,Yn be an independent sample from the d-variate Gaussian distribution N{^y, (Jyld) 
where a^ 7^ o'y. Let it be a permutation of {1, . . . ,N = m -\- n}. Let Z.,^ = (^7r(i:m) ~ 
^■K{m+i:N)) be the MD direction trained on the permuted labels determined by tt. We have 
for i = 1, . . . ,m, 

is non-zero. Similarly, for i = m + 1, . . . ,N, we have 

-E'((-^7r(i) - ^7r{m+l:Ar)) ^n ) 

is non-zero. 

Proof. We prove the first statement. The second can be shown in a similar fashion. Let 
P{n, k) denote the number of k permutations of n, i.e. 

P(n, fc) = n • (n - 1) • (n - 2) • • • (n - A: + 1) 

We have for i = 1, . . . , m and k = 1, . . . , d, 

^((^7r(i) - ^7r(l;m)) Zy) = E{{Z^(^i) - Z^{X:ra)) (^7r(l:m) " ^7r(m+l:Af)) ) 



(k) ^2 



i^^wdM m 



m 

VariZ ,\) + U^ m n\ 

m m — 1 TT[L.m) 
— variZ ^ 



2 -. I m— 1 

= T7{— 2 — V^("i- l,r)P{n,m - r)[ral + {m - r)al]} 

^ r=0 

+ ttI— ^ — / P{n — l,r)P(m,m — r)[ral + (m — r)al]} 

iV m m^ tt;2 "^ 

r=0 
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where wi and W2 are the weights 

m—l n—1 

wi := y P{m — l,r)P{n,m — r) and W2'-=/ P{n — l,r)P{m,m — r) 

r=0 r=0 

Thus if (T^ 7^ ay, we have £'((Z^(j) — Z^i^^^.^^^^^'Z)^ ') is nonzero. D 

,] (k) (k) 

Lemma 3. Let Zi,Z2 be two random variables in M such that Z\ Z^ are i.i.d. for 
k = 1, . . . ,d and Pi(Z\ Z^ ) exists and is finite. Then 

^{Zi ■ Z2f -^ [S(zf ^Zf ^)]2 in probability 



Proof. By the Law of Large Numbers, we have 

hzi-Z2)^ E{Z^^^zf^) in probabihty. 



By Continuous Mapping Theorem, we have 

1 

d2 



^ 'Zi • Z2f -^ [S(zf ^Zf ^)]2 in probabihty. 
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Now we have all the necessary ingredients to prove Theorem [2] in Section 4.2 



Proof of Theorem^ To prove the first part of Theorem 2l we decompose s^ and s^ into a 
sum of independent variables. Let X^^i be the sample mean of the first k — 1 projections 
Xi, . . . Xk~i. We will write s^ in a recursive fashion. Define s\ = 0. We will use the 
following recursive formula to define s\ for k = 2, . . . ,m 

(k - 1)4 = (fc - 2)4_i + ^(X, - X,_,f (10) 

Since s1_i is independent of {Xk — Xk~i)'^, this recursive viewpoint allows us to decompose 
s^ = s^ into a sum of independent terms. Using the result in Lemma 1 and the second- 
order Delta method, we have 



-r^ ^x(l)asd-^oo (11) 

rriCj2(o-2/m + cr2/n) 



Inputting expression ( 11 ) into the recursion defined in ( 10 ) and exploiting the independence 



of the individual terms in s^ , we get 



.2 ^2 



-s% — > 70-^(^ + —)x^{m - 1) as d -^ oo 

d ^ m-1 m n 
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Similarly, we can show for the sample of projections Yi, . . . ,Yn, 



A -^ .cr'yC- + -^)x'{n - 1) as d ^ oo 



Thus we have 



1 1 / s - s - 

~j^m.n{Z) = — 1 

a ' a \ m n 



1 ^x,^' , ''I. 2, -,^ , 1 ''I A , ^y^ 2 



(- + ^)^^(^ _ 1) + ^,^^(- + ^)^^(„ _ 1) 



m — 1 m m n n — 1 n m n 

For the second part in Theorem [2j we expand the sample variance of the projected 
values in the permuted group as follows: 



i=l 



— [ 2^(-^7r(J) - ^7r(l:m)) 

^ m 

—r ^((■^7r(i) — ZiT{l:m)) " (-^7r(l;m) " -^7r(m+l:Af))) 



1=1 



Lemma 2 shows E{Z^^,i) — ■^7r(i:m)) (-^7r(i:m) ~'^7r(m+i:Af)) IS uouzero. Now apply Lemma 
[slwith Zi = (Z^(j)-Z^(i.^)) and Z2 = (^7r(i:m)--^7r(m+i:Ar)) to See that ^s|^ ^^ converges 
in probability to a nonzero constant. A similar argument can be applied to s\ 
Combining these results, it immediately follows that ^Sm,n{ZT^) converges in probability 
to a nonzero constant. 
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B MD-scaled MD 



We established in Section 4.1 that under certain conditions, the MD-MD test is valid when 
sample sizes are balanced. Under these same conditions, MD-MD is no longer a valid test 
however when sample sizes are unbalanced. Here we propose a modification of MD-MD, 
called MD-scaled MD, that is asymptotically valid, as m, n — )• 00 for fixed d, for equality of 
means when covariances are unequal and sample sizes are unbalanced. 

We have chosen the classical asymptotic regime here to take advantage of the following 
results. Janssen proved the permutation test for equality of means based on the studentized 
statistic, 

m'l\X-Y)/{sl + '^slf/^ (12) 
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where s^ and Sy are the standard unbiased estimators of a'^ and ay, is asymptotically valid 



as m, n — )• oo for the univariate case (Janssen 1997). Janssen's result easily extends to 



the multivariate case if we assume a spherical covariance structure. Let Xi, . . . ,Xm be a 
sample from a d-variate distribution with mean and covariance {fj,x,(^x-^d) and Yi, . . . , Y^, be 
an independent sample with mean and covariance (/iy,(T^/rf). We propose the MD-scaled 
MD DiProPerm test whereby the MD direction is used in Step 1 of DiProPerm and a scaled 



MD statistic as in Equation (12) is used in Step 2. The MD-scaled MD statistic is 



r„,„(Z)/{4/m + 4/n}i/2 



(13) 



where Tm,n(2^) is as in Section 4.1 The asymptotic validity of the MD-scaled MD statistic 
(as m,n — )• oo) follows immediately from Janssen's result. Note that normality is not an 
assumption here. 




(a) MD-MD: Unbalanced (b) MD-scaled MD: Unbalanced 

Figure 11: Power surface of the MD-MD and the MD-scaled MD for testing equality of 
means for distributions -Fi = N{ij,i,afld) and F2 = N^O,!^). When sample sizes are 
unbalanced, the MD-scaled MD test attains the correct level while the MD-MD does not. 



We study the empirical power of the MD-MD and MD-scaled MD for testing equality 
of means when sample sizes are unbalanced. We set m = 50, n = 100 and make m draws 
from Fi = N{fj,i,afld) and n draws from F2 = N{0, Id) for d = 500. The sample sizes and 
dimension are chosen to reflect a HDLSS setting. The significance level is set at a = 0.05. 
Power is estimated using 1000 Monte Carlo simulations and displayed using a color spectrum 
from cool to warm, corresponding to the range to 1. 



Figure 11 as in the figures in Section 4.3, displays the simulated power surface of MD- 
MD and MD-scaled MD. We see that when sample sizes are unbalanced and covariances 
unequal (af 7^ 1), MD-MD does not attain the correct level. Indeed MD-MD will reject 



increasingly often as the signal in af grows. On the other hand, we see from Figure lib 
that the MD-scaled MD test attains the correct level under unbalanced sample sizes. This 
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simulated power study also suggests that the asymptotics for the MD-scaled MD test is in 
effect for relatively small sample sizes and a much larger dimension. 

C Additional Implementation Options 

C.l Direction 

The following binary linear classifiers are among many possible choices for the direction 
vector used in Step 1 of DiProPerm and all are implemented in the DiProPerm software: 

1. The Mean Difference method is a simple binary linear classifier, also called the centroid 



method ( Hastie et al. , 2003 ) , where points are assigned to the class whose centroid 



is closest. The normal vector to the separating hyperplane is the unit vector in the 
direction of the line segment connecting the centroids of each class, {X — Y). 

2. Fisher Linear Discrimination (FLD) was an early binary linear classification method. 



see Chapter 11 of Mardia et al. (1979) for an introduction. FLD seeks a separation 



that maximizes the between sum-of-squares of the two classes while minimizing the 
within sum-of-squares of each class. The normal vector to the separating hyperplane 
is the unit vector in the direction of W^^{X — Y)' where W is the d x d matrix 



W 



Y,{X. - X){X, - Xy + J^(y, - Y){Y, - Y)' 



i=l 



3. Support Vector Machine (SVM) is a popular binary linear classification method that 
minimizes training error while maximizing the margin between the two classes. See 



Hastie et al. (2003) for a good introduction. 



4. Distance Weighted Discrimination (DWD) is a binary linear classifier similar to SVM 



except each data point has some weight in the final classifier (Marron et al. , 2007) 



DWD better avoids the data piling problem exhibited by SVM in high dimensions. 

5. Maximal Data Piling (MDP) is a binary linear classifier such that the projections 
of the data points from each class onto its normal direction vector have two distinct 



values (Ahn and Marron, 2010) 



Notice that we have not included any PC A directions on this list. This is because PC A is 
tailored to find directions that show maximal variation, which is different from our objec- 
tive of finding directions that show separation between the two-samples. A more serious 
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disadvantage to using PCA as the direction in step 1 of DiProPerm is that the univari- 
ate two-sample test statistic calculated in step 2 of DiProPerm would be invariant under 
relabelings. 

C.2 Projection and univariate statistic 

In the second step of DiProPerm, we project the data onto the direction in step one and 
compute a univariate two-sample statistic on the projected values. Large values of the test 
statistic indicate departure from the null hypothesis. The following univariate two-sample 
statistics are among many reasonable choices for the DiProPerm procedure and all are 
implemented in the DiProPerm software. 

1. Two Sample t statistic 

2. Difference of sample means 



3. Difference of sample means scaled, as in Equation (13) 

4. Difference of sample medians 

5. Difference of sample medians, divided by the median absolute deviation. 

6. Area Under the Curve (AUG), from Receiver Operating Characteristic (ROC) curve 

7. Paired sampling t-statistic 

It is of interest to note that the classical Hotelling T^ statistic is a special case of the 
general DiProPerm framework. The FLD direction vector and the difference of sample 
means combination gives the statistic {X — Y)W^^{X — Y)'. This is in fact the Hotelling 
T^ test statistic scaled by a factor of ^^32n~n~' '^*-' ^^^ this, recall the Hotelling T^ statistic 
is 



^,2 _ ^1712^^ v^c.-l 

where 



n 



^u 



YZi^X,-X)[X,-X)' + Y.U^Y,-Y){Y,-Y)' _ w 



n-2 n-2 



The MD-FLD statistic is ^^^T^ . 
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C.3 Permutation 

In the final step of DiProPerni, an approximate permutation test is conducted to assess the 
significance of the test statistic in step two. Our permutation test is approximate because 
we perform a large number of random rearrangements of the labels on the observed data 
points, rather than all possible rearrangements. There are three kinds of indicators we 
commonly use and all are implemented in the DiProPerm software: 

1. Empirical p- value: this is calculated as the proportion of the rearrangement test statis- 
tics that exceed the original test statistic. The empirical p- value has the disadvantage 
of often being zero. We may wish to compare two separations to see which is more 
significant. This motivates the next quantity. 

2. Gaussian fit p-value: we fit a Gaussian distribution to the permutation test statistics 
and based on this calculate the percentage of rearrangement test statistics that ex- 
ceed the original test statistic. (The term p-value is used loosely here). We do this 
not because we believe the permutation statistics are actually Gaussian, but because 
this provides a basis on which we can compare two DiProPerm results. In certain 
settings where the Gaussian fit p-value may suffer from round-off error, we use the 
next quantity as an alternative. 

3. z-score: we fit a Gaussian distribution to the permutation test statistics and calcu- 
late the corresponding z-score of the original test statistic with respect to the fitted 
distribution. 

When interpreting the results of DiProPerm tests, it is generally useful to print all three 
indicators. When it is non-zero, the empirical p-value is the most interpretable. When it 
is zero we next look to the Gaussian fit p-value. Finally if the Gaussian fit p-value suffers 
from round-off error, the z-score is preferable. 

D HDLSS calculations 

Let X ~ N[0,a'^ld) and Y ~ iV(0, (T^/(i). We will study the asymptotic behavior of the 
distance between X and Y. We have simply by definition 
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Then by the Central Limit Theorem, 






as d — )■ oo. Applying the Delta Method, we get 



^ I — ll- ^->^ll _ JL 1 = Op{i) 



and thus 



X-Y\\ = J{al + al)d + Op{l) 
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